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Abstract 

In principle, the design of transition-based 
dependency parsers makes it possible to 
experiment with any general-purpose clas- 
sifier without other changes to the pars- 
ing algorithm. In practice, however, it of- 
ten takes substantial software engineering 
to bridge between the different representa- 
tions used by two software packages. Here 
we present extensions to MaltParser that 
allow the drop-in use of any classifier con- 
forming to the interface of the Weka ma- 
chine learning package, a wrapper for the 
TiMBL memory-based learner to this in- 
terface, and experiments on multilingual 
dependency parsing with a variety of clas- 
sifiers. While earlier work had suggested 
that memory-based learners might be a 
good choice for low-resource parsing sce- 
narios, we cannot support that hypothesis 
in this work. We observed that support- 
vector machines give better parsing perfor- 
mance than the memory-based learner, re- 
gardless of the size of the training set. 

1 Introduction 

Here we present malt-libweka, a library that ex- 
tends MaltParser to allow users to experiment with 
any supervised machine learner compatible with 
the Weka machine learning package. This signif- 
icantly reduces the software engineering effort re- 
quired to integrate new classifiers with MaltParser. 
The Weka distribution comes with many classi- 
fiers, and third-party classifiers may additionally 
provide interfaces to Weka. In the cases where 
they do not, it is fairly straightforward to imple- 
ment an appropriate wrapper so that the package 
in question can be used with Weka, and so now 
also with MaltParser. We have done precisely this 
with TiMBL, the Tilburg Memory-Based Learner, 
and the process is described later in this paper. 



With these extensions to MaltParser, we car- 
ried out experiments in multilingual dependency 
parsing with a variety of classifiers, following 



the CoNLL-X shared task (Buchholz and Marsi, 



2006). We also generated learning curves for 



each classifier, to see how the different algorithms 
would perform with varying training data sizes. 
We had considered the hypothesis, as suggested by 
earlier work using several general-purpose clas- 



sifiers for the same NLP task (Banko and Brill, 



2001 ), that a memory-based learner would provide 
better parsing accuracy than MaltParser's default 
SVM and linear classifiers for small training sets, 
but our experiments with the default TiMBL set- 
tings do not support that hypothesis. Instead, we 
found that, absent any particular parameter tuning, 
SVMs gave us the best parsing accuracy for all of 
our experimental settings, for each of the four lan- 
guages in our experiments. 

2 Transition-Based Dependency Parsing 

Transition-based dependency parsers such as 
MaltParser (Nivre et al., 2006a) are popular for a 
number of attractive features. First, in their deter- 
ministic variety, they operate in linear time in the 
length of the input sentence, so are comparatively 
fast when compared with graph-based or chart- 
parsing methods, which operate in polynomial 
time (Kiibler et al., 2009| ). Secondly, transition- 
based methods give state-of-the-art parsing ac- 
curacy in many settings; in the recent CoNLL 
shared tasks on multi-lingual dependency parsing, 
many of the top-ranking systems were based on 
transition-based algorithms, and often used Malt- 



Parser specifically (Buchholz and Marsi, 2006 



Nivre et al., 2007"] ). Of additional interest is 



that transition-based parsing algorithms have an 
isolated classification task and can make use of 
general-purpose machine learners to address it. 
Thus the user of the parser may experiment with 
different classification algorithms or parameters 



while keeping the rest of the parsing system fixed. 

Deterministic transition-based dependency 
parsers come in several varieties, but in general, 
they make a single pass over an input sentence, 
token by token, and build a dependency struc- 
ture as the result of a bounded-length series of 
decisions. At each point in the processing of 
a sentence, the parser is said to be in a given 
configuration, and it must choose which possible 
transition to make in order to proceed to the next 
configuration: eventually, the parser makes its 
way from the initial configuration to a final one, in 
which all of the words in the sentence have been 
processed. This parsing approach is analogous to 
the shift-reduce parsing that one might use in a 



constituency parsing setting ( |Kubler et al, 20 09). 

Typically, in the initial configuration, an input 
sentence has been loaded into a buffer B, and 
there is an empty stack S, which will have tokens 
pushed on to it and popped off in the subsequent 
transitions. Along the way, dependency arcs are 
formed between words on the front of the buffer 
and the top of the stack, and these are added to 
A, the set of current arcs, which, in the final con- 
figuration, constitutes the dependency parse of the 
sentence. 

There are several possible "transition systems" 
that can be used with transition-based dependency 
parsing, each of which provides a different set 
of transition operations for proceeding through 
the configurations in the derivation of a particular 
parse. Many, but not all, transition systems derive 
only projective dependency trees, which is to say 
that for any directed arc (w{ , r, Wj ) (an arc from 
word Wi to word Wj with dependency relation r), 
all of the words between wi and Wj are also ei- 
ther dependents of Wi or transitively dependent on 
it. Thus for projective trees, dependency relations 
describe contiguous regions where all of the words 
share the same head, not unlike the constituents 
that one might see in a constituency parsing task. 

A transition system for projective dependency 
trees should be both sound and complete with re- 
spect to the set of projective dependency trees; this 
is to say that every output that can be produced by 
the transition system is in fact a valid projective 
dependency tree (soundness), and that every pro- 
jective dependency tree can be produced by some 
sequence of transitions from the transition system 
(completeness). The soundness and completeness 
proofs for several transition systems are provided 



in ( |Nivre, 2008 1. A baseline transition system with 
only three operations (left-arc, right-arc, and shift) 
is described in ( [Kiibler et al., 2009] ); these three 
transitions are sufficient to produce any projective 
dependency tree. The intuition for the complete- 
ness proof is also provided in the book: an algo- 
rithm is provided to map from any projective de- 
pendency tree to a sequence of these three tran- 
sitions, and therefore a transition sequence exists 
such that any particular projective tree could be 
produced by this transition system. 

We have yet to describe the process of how the 
parsing algorithm decides which transition to take, 
out of all possible transitions from the current con- 
figuration. Given an oracle, a parser could make 
optimal decisions about how to best proceed to 
the correct parse; in practice, supervised machine 
learning techniques are used to simulate an oracle. 
The parser has a classifier that has been trained to 
predict, for a given configuration, what the best 
available transition is. The training data for these 
classifiers is produced from a dependency tree- 
bank using an algorithm like the one mentioned 
previously, mapping from parses of sentences to 
sequences of (configuration, transition) pairs. Fea- 
tures are then extracted from the configurations, 
and the classifier is trained to predict the transi- 
tions, given the features. Commonly used features 
include the forms, part-of-speech tags, and depen- 
dency relations associated with the top word of 
the stack or next upcoming word from the buffer, 
though other variations are possible. 

While the techniques described so far only pro- 
duce projective trees, it is often useful, in de- 
scribing the syntax of natural languages, to allow 
non-projective dependency structures with cross- 
ing arcs. Many of the non-projective structures are 
familiar from the constituency-parsing world as 
those that cause difficulties for context-free gram- 
mars. For example, in English, topicalization or 
wh-pronouns (in the case of questions) often make 
the object of a verb appear outside of a contiguous 
range with the rest of the dependents of the verb. 
Non-projective structures are also common in lan- 
guages with more free word order. 

There are at least three different ways to 
produce non-projective dependency trees with a 
transition-based parser. One could use a different 
transition system with extra operations that pro- 
duce non-projective trees by moving words from 
the stack back on to the buffer, as described in 



( Kiibler et al., 2009]), or one could use a modified 



parsing algorithm like that of Covington, which 



makes use of more than one stack (Covington, 
2001). Alternatively, one could use a "pseudo- 



projective" approach, where the non-projective 
structures are converted to projective ones and an- 
notated in the dependency labels during a pre- 
processing step. Then at parse time, the clas- 
sifier will hopefully predict the enriched labels 
when generating projective trees; these labels in- 
clude enough information to reconstruct the non- 
projective trees. This approach is very effective 
in practice, and was used for many of the win- 



ning CoNLL-X shared task entries (Buchholz and 



Marsi, 2006). 



In this work, we are concerned with Nivre's 
Arc-Eager transition system, initially described in 



(Nivre, 2003), which has the operations shift, left- 
arc r , right-arc r , and reduce. The arc-creating op- 
erations are parameterized by some dependency 
relation r from the set of possible dependency re- 
lations R, which varies according to the task or 
treebank in question. The Arc-Eager transition 
system, without modifications, produces only pro- 
jective dependency trees, but can be used with 
pseudo-projective parsing. Arc-Eager modifies 
earlier systems that did not have a separate re- 
duce operation, and would eliminate words from 
the buffer immediately upon attaching them to 
their heads, if they appeared to the right of the 
head. The Arc-Eager system adds the the reduce 
operation and thus permits transition sequences 
in which appropriate arcs can be created eagerly, 
with the dependent word being used in subsequent 
arcs as well, since its right-arc operation does not 
eliminate the dependent word. 

3 MaltParser 

MaltParser is a popular package for transition- 
based dependency parsing, developed by Hall, 
Nilsson and Nivre3 MaltParser comes with im- 
plementations of several (nine, as of the current 
version) transition systems for dependency pars- 
ing; the default is Nivre's Arc-Eager system. Malt- 
Parser also comes with transition systems that can 
produce non-projective trees, and pre- and post- 
processors for pseudo-projective parsing. 

For learning to make transition decisions, Malt- 
Parser is packaged with two classifier libraries, 



LIBSVM ( |Chang and Lin, 201 1| ) and LIBLIN- 
EAR ( Fan et al., 2008] ). These packages provide a 
variety of classification techniques, including sup- 
port vector machines with various kernels, linear 
support vector machines, and logistic regression. 
Each of these classifiers has tunable parameters, 
such as the type of kernel used for SVMs, or reg- 
ularization options for logistic regression. Malt- 
Parser uses SVMs with a polynomial kernel by de- 
fault; this was the kernel used by the MaltParser 
team during both of the CoNLL multilingual de- 
pendency parsing shared tasks. In earlier ver- 
sions of MaltParser, memory-based learning with 
TiMBL was also supported ( |Nivre et al., 2004| ), 
although this has been removed in the post- 1.0 
versions of the system, which are implemented in 
Java. Previous to version 1.0, MaltParser was writ- 
ten in C. 

MaltParser seems to have been designed with 
generality and extensibility in mind; it has has 
an internal API for integrating arbitrary classi- 
fiers, and much of the program logic has been 
pushed into separate XML files and expressed 
declaratively. However, large portions of the Malt- 
Parser code are specific to LIBSVM and LIBLIN- 
EAR, and no documentation about how to add 
more classifier libraries is provided, so researchers 
who wish to experiment with other classifiers have 
a significant software engineering task ahead of 
them. 

4 Weka 



Ihttp : / /maltparser . org 



version 1.7.1 was used 



Weka ( |Hall et al, 2009| ) is a popular machine 
learning toolkit for Java, freely available onlind^ 
It includes implementations of a variety of ma- 
chine learning algorithms, and each algorithm for 
a given task - classification, clustering, etc. - fol- 
lows a common interface. Weka can be used ei- 
ther as a stand-alone application or as a library 
for other JVM programs, and for any given task 
or data set, Weka makes it convenient to experi- 
ment with different machine learners and param- 
eters for those learners. Several third-party ma- 
chine learning packages also include wrappers for 
the Weka interface, allowing them to be plugged in 
to any application using the Weka standard. This 
variety and generality makes it seem like a natu- 
ral fit with transition-based dependency parsing; 
we would like to make it possible to try any clas- 
sification algorithm as a component of a parsing 



in this work 



]http : / /www . cs . waikato . ac . nz /ml /weka/ 



system. 

One caveat about machine learners included 
with Weka is that they are not necessarily high- 
performance, particularly when compared with 
the implementations of the same algorithms from 
special-purpose packages such as LIBSVM and 
LIBLINEAR; while it was easy, from a perfor- 
mance standpoint, to use the decision-tree and 
Naive Bayes classifiers from Weka, we could not 
get any parses to succeed using Weka's logistic re- 
gression classifier, due to performance problems 
that will be discussed later. But Weka contains 
dozens of other classifiers, and some of them may 
be perfectly suitable for parsing with MaltParser. 

5 The CoNLL-X Shared Task 

In the 2006 CoNLL-X shared task on multilin- 



gual dependency parsing (Buchholz and Marsi, 



2006| ), participants built dependency parsing sys- 
tems capable of handling many languages, ide- 
ally with the same parsing algorithm and the same 
machine learners, although perhaps with differ- 
ent parameter settings per language. The evalu- 
ation was carried out over thirteen different lan- 
guages, from a variety of language families, al- 
though one (Bulgarian) was optional. The train- 
ing data made available to the participants con- 
tained some non-projective structures, as did the 
gold standard parses for the testing data, though 
the systems were not strictly required to produce 
non-projective parses. 



The CoNLL dependency format (Buchholz and et al., 2006b). 



Marsi, 2006) has become a standard for depen- 
dency parsers. CoNLL-formatted parse trees sep- 
arately describe each token of a sentence, and 
can include raw and lemmatized versions of each 
word, coarse- and fine-grained part of speech tags, 
additional lexical features, the head of the token, 
and token's dependency relation to the head. The 
lexical features present vary for each language, but 
they might include information like number, gen- 
der, and case. In practice, these features do not 
seem to be used by working parsers - MaltParser 
comes with feature sets that make use of in parts 
of speech, dependency relations, and the surface 
forms of words. 

The participants presented systems based on 
a variety of techniques, but the best systems 
used either transition-based dependency parsing 
or graph-based strategies like those of McDon- 



the transition-based systems, the highest-scoring 
parsers used pseudo-projective strategies, deter- 
ministic parsing algorithms, and support vector 
machines with polynomial kernels. 

6 Experiments 

To evaluate a variety of classifiers, we replicated 
the CoNLL-X (2006) shared tasks on multilin- 
gual dependency parsing. In the interests of repro- 
ducibility and frugality, here we ran experiments 
on the languages with freely available treebank^] 
Out of the thirteen languages in the evaluation, this 
leaves Danish, Dutch, Portuguese and Swedish. 

We did almost no parameter tuning or feature 
engineering, save making sure that the task could 
be run in 8 gigabytes of RAM. We used a feature 
set already present in MaltParser - the one used 
for parsing with the Arc-Eager transition system 
and LIBSVM - and default settings for each soft- 
ware package to get an initial sense for each clas- 
sifier's behavior. There are almost certainly clas- 
sifier settings and feature sets that would provide 
better parsing accuracy, but finding those is left to 
interested parties in the future. 

Additionally, we did not use pseudo-projective 
post-processing, since the goal of the experiments 
was simply to compare the available classifiers. 
The best CoNLL-X entries performed rather bet- 
ter than the parsers trained in these experiments, 
and this is due in part to their handling of non- 



projective structures, which is described in (Nivre 



aid's MSTParser (McDona ldet al, 2005) . Among 



We also generated learning curves for the pars- 
ing task, varying the amount of training data given 
to the classifiers in increments of a thousand sen- 
tences, from one thousand sentences, up to eleven 
thousand, which was roughly the entire training 
set for two of the four languages. For Danish, 
however, the training set was only 5190 sentences 
long, so the experiments for that language cut off 
at six repetitions, and for Portuguese, the train- 
ing set contains 9071 sentences, so the tenth and 
eleventh iterations were the same. In all cases, 
we show the labeled attachment score curves in 
Figure [T] The unlabeled attachment scores var- 
ied similarly, and of course were higher; some of 
these are given in Figure [4] In Figure [2] we also 
show the best available CoNLL-X labeled attach- 

3 The freely-available data for the CoNLL-X shared task is 
online at |http : / /ilk . uvt ■ nl/ conll/f ree_data . | 
|html| 
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Figure 1: LAS learning curves for the four languages and various classifiers. 
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Figure 2: Labeled attachment score for the various languages and classifiers, at the maximum training 
corpus size, with the winning CoNLL-X scores, for comparison. 



ment scores for comparison. 

We report results for six different classifiers, 
which are listed in Figure [3] Three of the classifi- 
cation approaches that we tested, LIBSVM with 
a polynomial kernel, LIBLINEAR's linear sup- 
port vector machines, and LIBLINEAR's logis- 
tic regression (also known as a "Maximum En- 
tropy" classifier), are available by default with 
MaltParser. The remaining three made use of 
the malt-libweka interface; J48 decision trees and 
the Naive Bayes classifier are familiar algorithms 
that have implementations in Weka. The TiMBL 
memory-based learner also made use of the malt- 
libweka interface, through a new wrapper that was 
implemented for these experiments. 

Looking at the results, we see that, in all cases, 
the SVM classifiers outperform the other classi- 
fiers, typically followed by logistic regression, de- 
cision trees, and TiMBL. Perhaps unsurprisingly, 
Naive Bayes does not give good parsing results on 
these tasks, and gave the worst performance for all 
settings; the features in a parsing task are not mu- 
tually independent. At a few points in the learning 
curves, logistic regression is met or outperformed 
by TiMBL or decision trees, but at all of the points 
along the curves, the SVM classifiers perform sev- 
eral LAS points better than the next-best classifier. 

We did observe, however, that across all four 
languages, the linear SVMs outperformed the 
polynomial-kernel SVMs when trained with the 
smallest corpora. Also, in parsing Swedish, the 
linear SVMs were consistently slightly better than 
the polynomial-kernel ones. The higher perfor- 
mance of the linear SVM on the smaller data sets 
could be explained by its higher bias, which is 
to say that, while it is less able to express com- 
plex hypotheses in higher-dimensional spaces, this 
makes it less likely to over-fit small training sets, 
so the result is not entirely surprising. 

Our confidence that the implementation of malt- 
libweka is basically correct, and not the source 
of the lower performance of the other classifiers, 
stems from comparing the performance of the 
J48 and TiMBL classifiers with that of the lin- 
ear regression setting for LIBLINEAR; they had 
very comparable performance with one of the ma- 
jor modes of operation for LIBLINEAR, in some 
cases equalling or outperforming it. It seems that 
support vector machines are simply a good choice 
for parsing tasks. 



7 Software 

One contribution of this work is a reusable pack- 
age, malt-libweka, which is freely available on- 
line^] malt-libweka itself is a library that works 
with MaltParser; its repository includes scripts 
that can be used to reproduce the results in this 
paper, or could be easily modified to do further 
parsing experiments on treebanks in the CoNLL 
format. 

While MaltParser is open-source and de- 
signed for extensibility, the development of malt- 
libweka took non-trivial software engineering ef- 
fort, largely due to a lack of documentation of 
the internals of MaltParser, which was perhaps 
not implemented with the convenience of third- 
party developers in mind. While MaltParser ap- 
parently has a plugin system, both mentioned in 
the online documentation and present in the source 
code, it was non-obvious how to use it, and we 
could not find examples of it being used in prac- 
tice. Whether or not the plugin system is in a us- 
able state, MaltParser's source tree definitely con- 
tains substantial amounts of "dead code" with mis- 
leading names. Particularly, while the "LibSvm" 
and "LibLinear" classes are instrumental in Malt- 
Parser's interfaces to the corresponding machine 
learning packages, MaltParser also contains the 
classes "Libsvm" and "Liblinear" - note the cap- 
italization differences. The latter two seem to be 
entirely vestigial, and not called in the current ver- 
sion. 

Hopefully the use of malt-libweka will save 
future developers from having to delve too 
much into the source of MaltParser when 
they would like to experiment with Malt- 
Parser and different classifiers. With malt- 
libweka, users need only adapt their machine 
learners to the interface used in Weka; a 
straightforward example of this is provided in 
the maltparser . TimblClassif ier class. 
For this use case, the only required methods 
in the interface are buildClassif ier and 
classif ylnstance, which, respectively, train 
a classifier given a set of training instances and 
return a predicted class for a given instance. 
The existing MaltParser code, coupled with malt- 
libweka, handle the rest of the process, including 
extracting the relevant features from parse config- 
urations and then making those features available 

' http : / /github . com/ alexrudnick/ 
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Figure 3: The six different classifiers used in experiments. 
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Figure 4: Scores for different classifiers (rounded, as a percentage), on each of the four languages, 
Danish, Dutch, Portuguese, and Swedish. The designation (sm) is for the smallest training set that was 
tried with the given language, and (lg) indicates the largest. In all cells of the table, the first number is 
the LAS, and the second is the UAS. 



to the machine learner, both at training and parsing 
time. 

7.1 Implementing the TiMBL- Weka 
Interface 

TiMBL, described in detail in (Daelem ans et al., 
2010| ), is a package for memory-based learning, 
and is freely available onlind^ As a lazy learner, 
TiMBL's training process consists of storing all of 
the examples that it is given for use at classifica- 
tion time, where the values of the features of these 
examples could be treated in a symbolic, nominal 
way, or as numbers over which there is an order- 
ing or a distance. At an implementation level, the 
TiMBL classifier can be run as a server with the 
timblserver package, making it possible to inter- 
face with TiMBL from a program written in any 
language that has a networking library. 

In implementing the Weka wrapper for TiMBL, 
we had to implement the code for training time, 
and for classification time. For the code for the 
training procedure, we serialize all of the training 
instances to a file readable by the TiMBL software, 
which uses, roughly, a CSV format. Then, when 
it is time to parse, we need to be able to classify 
new instances, so the Java program opens a con- 
nection to the timblserver - which must be started 

' http: //ilk.uvt ■ nl/timbl/| 



by some external program, in this case the scripts 
that manage the parsing experiments - and serial- 
izes a new instance then sends it across the net- 
work. Then TiMBL sends back its classification 
result, and we must, on the Java side, reinterpret 
the result as a number for MaltParser's consump- 
tion. The implementation of the Weka wrapper for 
TiMBL took roughly 100 lines of Java, most of 
which manages the network connection. 

8 Discussion 

An issue that we encountered during develop- 
ment was that we had to maintain the meaning of 
the representations used internally by MaltParser, 
when they were passed to Weka classifiers. At an 
early stage of processing, MaltParser builds sev- 
eral vocabularies, mapping from tokens and the 
provided lexical features (POS tags, etc) to inte- 
gers, so that the system need not pass around large 
strings. So sensibly, both at training time and at 
parsing time, the classifiers called by MaltParser 
are presented with integers rather than strings. 
These should not be treated as anything other than 
unique identifiers, but it would be fairly easy for a 
programmer to allow a classifier trained on these 
numbers to interpret them as ordinal numbers or 
take distances over them. But during early devel- 
opment of the system, we made exactly this mis- 



take; we discovered the problem when inspecting 
the decision trees learned by Weka's J48 classi- 
fier, which was making comparisons with a less- 
than operator. J48 is a Java reimplementation of 
the C4.5 algorithm (Quint an", 1993| ), which will try 
to do comparisons over ordinal numbers given the 
opportunity. With this in mind, we made Weka in- 
terpret the features passed to it as nominal features 
- although they are still represented as numbers - 
which prevents order-based comparisons. 

However, the logistic regression algorithm, in 
a mathematical sense, is defined in terms of dis- 
tances over numbers. If the Weka implementation 
is given nominal attributes, it will binarize them 
into a larger number of binary attributes. In this 
process, an attribute that has n possible values is 
transformed into n different binary attributes. So 
for many features passed to the learner during the 
parsing task, there are many thousands of possible 
values. If we consider the feature "which word is 
on the top of the stack", any word in the vocabu- 
lary could appear. 

During development, we ran across a few 
surprising performance problems. The bi- 
narization code in Weka is much less effi- 
cient than it could be, and while trying to 
parse some of the smaller datasets, the sys- 
tem would run out of memory during binariza- 
tion, even when given 8 gigabytes of RAM. 
This seemed surmountable, so we implemented 
a more efficient version of feature binarization 
(maltlibweka . FastBinarizer), in hopes 
that this would let us experiment with Weka's lo- 
gistic regression. But the training times for Weka's 
Logistic class ended up being unbearably long 
and prohibitively memory-intensive when given 
large numbers of binary features, so we also tried a 
few approaches for feature selection, though were 
not successful in this regard. In the end, we gave 
up on Weka's logistic regression implementation, 
although we had hoped to compare it to the one in 
LIBLINEAR. 

So while any given classifier may not perform 
well in terms of parsing accuracy, or even compu- 
tational efficiency - as we have seen in the course 
of this work - malt-libweka makes it straight- 
forward to try new classifiers and new parame- 
ters for those classifiers on parsing tasks. And 
to adapt a new classifier to work with Weka and 
thus malt-libweka, the programmer need only pro- 
vide a method that trains the classifier given a set 



of training instances and another that classifies a 
given instance after training, barring mishaps with 
the classifier not scaling well to the parsing task. 

9 Conclusions and Future Work 

We have introduced extensions to MaltParser that 
enable experimentation with different classifiers 
for transition-based dependency parsing, mak- 
ing such experiments straightforward in practice, 
when previously they were only straightforward 
in theory. We have also presented experiments 
with six classifiers for a standard multilingual de- 
pendency parsing task, including varying the size 
of the training set. We were not able to support 
the hypothesis that memory-based learners pro- 
vide better parsing accuracy than support-vector 
machines in low -resource settings; in fact, for set- 
tings with small training sets as well as those with 
comparatively large ones, support vector machines 
continue to perform the best out of the approaches 
considered. Our results also suggest that for small 
training sets, linear support vector machines are a 
good choice. 

There may well be classifiers, or parameter set- 
tings for the algorithms, that learn better parsers 
for training sets of these sizes for these languages. 
There may also be better feature sets, perhaps 
making use of agreement information for morpho- 
logically rich languages, and those with more free 
word order. Finding out which parameters and 
which settings is, however, left to future work. 
Hopefully malt-libweka will make these experi- 
ments easy to carry out. 



References 

[BankoandBrill2001] Michele Banko and Eric Brill. 
2001. Scaling to Very Very Large Corpora for Nat- 
ural Language Disambiguation. In Proceedings of 
39th Annual Meeting of the Association for Compu- 
tational Linguistics, pages 26-33, Toulouse, France, 
July. Association for Computational Linguistics. 

[Buchholz and Marsi2006] Sabine Buchholz and Erwin 
Marsi. 2006. CoNLL-X Shared Task on Multilin- 
gual Dependency Parsing. In Proceedings of the 
Tenth Conference on Computational Natural Lan- 
guage Learning (CoNLL-X), pages 149-164, New 
York City, June. Association for Computational Lin- 
guistics. 

[Chang and Lin2011] Chih-Chung Chang and Chih- 
Jen Lin. 2011. LIBSVM: A library for sup- 
port vector machines. ACM Transactions on In- 
telligent Systems and Technology, 2:27:1-27:27. 



. ntu . | 



Software available at http : / / www . 
|edu.tw/~cjlin/libsvm| 



[Covington2001] Michael A. Covington. 2001. A 
fundamental algorithm for dependency parsing. In 
In Proceedings of the 39th Annual ACM Southeast 
Conference, pages 95-102. 

[Daelemans et al.2010] W. Daelemans, J. Zavrel, 
K. van der Sloot, and A. van den Bosch. 2010. 
TiMBL: Tilburg Memory Based Learner, version 
6.3, Reference Guide, ILK Technical Report 10-03. 

[Fan et al.2008] Rong-En Fan, Kai-Wei Chang, Cho-Jui 
Hsieh, Xiang-Rui Wang, and Chih-Jen Lin. 2008. 
LIBLINEAR: A Library for Large Linear Classi- 
fication. Journal of Machine Learning Research, 
9:1871-1874. 



Shared Task on Dependency Parsing. In Proceed- 
ings of the CoNLL Shared Task Session of EMNLP- 
CoNLL 2007, pages 915-932, Prague, Czech Re- 
public, June. Association for Computational Lin- 
guistics. 

[Nivre2003] Joakim Nivre. 2003. An Efficient Algo- 
rithm for Projective Dependency Parsing. In Pro- 
ceedings of the 8th International Workshop on Pars- 
ing Technologies (IWPT), pages 149-160. 

[Nivre2008] Joakim Nivre. 2008. Algorithms for de- 
terministic incremental dependency parsing. Com- 
putational Linguistics, 34(4):5 13-553, December. 

[Quinlanl993] J. Ross Quinlan. 1993. C4.5: programs 
for machine learning. Morgan Kaufmann Publish- 
ers Inc., San Francisco, CA, USA. 



[Hall et al.2009] Mark Hall, Eibe Frank, Geoffrey 
Holmes, Bernhard Pfahringer, Peter Reutemann, and 
Ian H. Witten. 2009. The WEKA data mining soft- 
ware: an update. SIGKDD Explorations, 11(1): 10- 
18. 



[Kubler et al.2009] Sandra Kubler, Ryan T. McDonald, 
and Joakim Nivre. 2009. Dependency Parsing. 
Synthesis Lectures on Human Language Technolo- 
gies. Morgan & Claypool Publishers. 



[McDonald et al.2005] Ryan McDonald, Fernando 
Pereira, Kiril Ribarov, and Jan Hajic. 2005. Non- 
projective dependency parsing using spanning tree 
algorithms. In Proceedings of Human Language 
Technology Conference and Conference on Empiri- 
cal Methods in Natural Language Processing, pages 
523-530, Vancouver, British Columbia, Canada, 
October. Association for Computational Linguistics. 



[Nivre et al.2004] Joakim Nivre, Johan Hall, and Jens 
Nilsson. 2004. Memory-Based Dependency Pars- 
ing. In Hwee Tou Ng and Ellen Riloff, edi- 
tors, HLT-NAACL 2004 Workshop: Eighth Confer- 
ence on Computational Natural Language Learn- 
ing (CoNLL-2004), pages 49-56, Boston, Mas- 
sachusetts, USA, May 6 - May 7. Association for 
Computational Linguistics. 

[Nivre et al. 2006a] Joakim Nivre, Johan Hall, and Jens 
Nilsson. 2006a. MaltParser: A data-driven parser- 
generator for dependency parsing. In In Proc. of 
LREC-2006, pages 2216-2219. 



[Nivre et al. 2006b] Joakim Nivre, Johan Hall, Jens 
Nilsson, Giilsen Eryigit, and Svetoslav Marinov. 
2006b. Labeled Pseudo-Projective Dependency 
Parsing with Support Vector Machines. In Pro- 
ceedings of the Tenth Conference on Computational 
Natural Language Learning ( CoNLL-X), pages 221— 
225, New York City, June. Association for Compu- 
tational Linguistics. 



[Nivre et al.2007] Joakim Nivre, Johan Hall, Sandra 
Kubler, Ryan McDonald, Jens Nilsson, Sebastian 
Riedel, and Deniz Yuret. 2007. The CoNLL 2007 



